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Abstract 

Background: Repetitive elements comprise at least 55% of the human genome with more recent estimates as 
high as two-thirds. Most of these elements are retrotransposons, DNA sequences that can insert copies of 
themselves into new genomic locations by a "copy and paste" mechanism. These mobile genetic elements play 
important roles in shaping genomes during evolution, and have been implicated in the etiology of many human 
diseases. Despite their abundance and diversity, few studies investigated the regulation of endogenous 
retrotransposons at the genome-wide scale, primarily because of the technical difficulties of uniquely mapping 
high-throughput sequencing reads to repetitive DNA. 

Results: Here we develop a new computational method called RepEnrich to study genome-wide transcriptional 
regulation of repetitive elements. We show that many of the Long Terminal Repeat retrotransposons in humans are 
transcriptionally active in a cell line-specific manner. Cancer cell lines display increased RNA Polymerase II binding 
to retrotransposons than cell lines derived from normal tissue. Consistent with increased transcriptional activity of 
retrotransposons in cancer cells we found significantly higher levels of L1 retrotransposon RNA expression in 
prostate tumors compared to normal-matched controls. 

Conclusions: Our results support increased transcription of retrotransposons in transformed cells, which may 
explain the somatic retrotransposition events recently reported in several types of cancers. 
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Background 

The initial sequencing of the human genome revealed 
that -55% of the genome is comprised of repetitive 
DNA sequences [1]. More recent computational ap- 
proaches indicate the proportion of repetitive elements 
in the human genome may be as high as two-thirds [2]. 
Identified repetitive DNA sequences can be character- 
ized using five broad categories. Four minor categories, 
accounting for -10% of genomic DNA, include simple 
sequence repeats, segmental duplications, tandem re- 
peats and satellite DNA sequences, and processed pseu- 
dogenes. The fifth category is transposable elements, 
accounting for -45% of genomic DNA and is primarily 
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composed of retrotransposons. Retrotransposable ele- 
ments (RTEs) are parasitic DNA sequences that can 
proliferate by a "copy and paste" mechanism and insert 
themselves into new genomic positions. RTEs are classi- 
fied into Long Terminal Repeat (LTR) elements, whose 
structure and mechanism of retrotransposition resem- 
bles that of retroviruses, and non-LTR elements, which 
do not contain LTRs, resemble integrated mRNAs, and 
have a distinct mechanism of retrotransposition [1], In 
humans only the non-LTR elements are believed to be 
capable of retrotransposition, and can be classified as ei- 
ther Long Interspersed Nuclear Elements (LINEs) or 
Short Interspersed Nuclear Elements (SINEs) [3]. They 
are predominantly represented by the LI and Alu fam- 
ilies, respectively. The process of retrotransposition re- 
quires the transcription of an mRNA intermediate and 
its reverse transcription into cDNA, and can lead to the 
disruption of genes by insertional mutagenesis. Retro- 
transposition occurs de novo in the germ-line and can 
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cause single-gene mutations that result in disease, an ex- 
ample being hemophilia A [4]. The LI protein machinery 
may also retrotranspose copies of genes and structural 
non-coding RNAs yielding processed pseudogenes. 

The majority of our understanding of retrotransposon 
transcription and function comes from studies of single el- 
ements and their DNA sequence, primarily autonomous 
elements capable of active retrotransposition such as the 
LIHs retrotransposon (a human- specific LI subfamily) or 
non-autonomous elements such as Alu that can retrotran- 
spose in trans using the LI protein machinery. These 
studies revealed that endogenous retrotransposons are re- 
pressed in human cells under normal conditions, predom- 
inantly via silencing by promoter DNA methylation [5]. 
However, when retrotransposons are expressed, such as in 
response to cellular stress, Alu is thought to be transcribed 
by RNA polymerase III (Pol III), and LI by RNA polymer- 
ase II (Pol II) from an internal promoter [5]. 

Few studies have attempted to survey transposable elem- 
ent transcription genome-wide. High throughput sequen- 
cing data poses a challenge to these studies due to the 
ambiguity in assigning short reads mapping to more than 
one genomic location (referred to here as multi-mapping 
reads). Application-specific strategies have been developed 
to recover multi-mapping reads, such as assignment of 
Cap Analysis Gene Expression (CAGE) reads to the most 
represented Transcriptional Start Site (TSS) in CAGE se- 
quencing data [6], a method to identify TSS. A genome- 
wide analysis of retrotransposon expression using CAGE 
data revealed that repetitive elements are expressed in the 
mouse in a tissue-specific manner [7]. 

More recent attempts to address systematically the 
ambiguity in read assignment have followed two comple- 
mentary strategies. The first attempts to include multi- 
mapping reads in computing the read coverage across the 
genome by either assigning reads proportionally to all 
matching regions [8,9], or by assigning them probabilistic- 
ally to a specific location based on the local genomic tag 
context [10]. The second strategy addresses the ambiguity 
in read mapping by assigning them to subfamilies of re- 
petitive elements as opposed to their specific locations 
across the genome. Early examples estimated repetitive 
element enrichment by mapping short read data to con- 
sensus sequences [11,12]. However, this approach did not 
account for the majority of genomic instances, many of 
which deviate from the consensus sequence. A more re- 
cent example of the second approach incorporated both 
consensus and genomic instances in the analysis but ex- 
cluded reads aligning to more than a single repetitive 
element subfamily [13]. Because individual repetitive 
element subfamilies are highly conserved within their fam- 
ilies, this latter approach excluded a significant fraction of 
mapping reads from the analysis. For example, the L1PA2 
and L1PA3 subfamilies have a high degree of homology; 



many reads mapping to one of these two subfamilies also 
map to the other and would be excluded. 

In this study we extend these approaches to quantify re- 
petitive element enrichment by utilizing all mapping reads 
in estimating read counts. The resulting computational 
pipeline, RepEnrich, was integrated with existing computa- 
tional tools to test for differential enrichment between two 
or more experimental conditions. We report here the re- 
sults of a whole-genome analysis of the transcription and 
regulation of repetitive elements, obtained by applying 
RepEnrich to both RNA-seq and ChlP-seq datasets for 
RNA Pol II, Pol III and associated transcription factors in 
a panel of human cell lines, as well as several chromatin 
activation and repression marks [14-20]. Finally, we iden- 
tify transposable elements overexpressed in tumor tissue 
collected from prostate cancer patients [21]. 

Results 

Comprehensive assessment of repetitive element 
enrichment 

In RepEnrich, reads are initially aligned to the unmasked 
genome and divided into uniquely mapping and multi- 
mapping reads. Uniquely mapping reads are tested for 
overlap with repetitive elements, while multi-mapping 
reads are separately aligned to repetitive element assem- 
blies representing individual repetitive element subfamilies 
(Figure 1). Repetitive element assemblies are represented 
by all genomic instances (assembled from the RepeatMasker 
annotation) of an individual repetitive element subfam- 
ily, including flanking genomic sequences, concatenated 
with spacer sequences to avoid spurious mapping of reads 
spanning multiple instances. The repetitive element as- 
semblies are an extension of the strategy used by Day 
et al [13], which however only used reads that could be 
unambiguously assigned to an individual subfamily. 

By combining the counts from uniquely mapping 
reads and multi-mapping reads RepEnrich keeps track of 
all repetitive elements that every read aligns to and sys- 
tematically estimates enrichment from all mapping reads. 
Using this strategy we can compute read abundance in 
three different ways. First, we can compute the total num- 
ber of reads mapping to each repetitive element subfamily 
(Additional file 1: Figure SI A), which we refer to as total 
counts. Second, we can compute the total number of reads 
mapping exclusively to a single repetitive element sub- 
family. This methodology is similar to the one used Day 
et al and we refer to it as unique counts (Additional file 1: 
Figure SIB). Third, we can count reads that map to a sin- 
gle repetitive element subfamily assembly once and assign 
reads that map to multiple subfamilies using a fractional 
value 1/N S (where N s is the number of repetitive element 
subfamily assemblies the read maps to), which we call 
fractional counts (Additional file 1: Figure SIC). 
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Figure 1 RepEnrich read mapping strategy. Reads are mapped to the genome using the Bowtiel aligner. Reads mapping uniquely to the 
genome are assigned to subfamilies of repetitive elements based on their degree of overlap to RepeatMasker annotated genomic instances of 
each repetitive element subfamily. Reads mapping to multiple locations are separately mapped to repetitive element assemblies - referred to as 
repetitive element psuedogenomes - built from RepeatMasker annotated genomic instances of repetitive element subfamilies. 



To investigate how these three counting strategies dif- 
fered in their ability to estimate read abundance, we 
used in silico generated ChlP-seq data. The ChlP-seq 
data simulators currently available [22] cannot modu- 
late the sampling rate of reads at specific loci in the gen- 
ome. Hence, we developed a general-purpose Hidden 
Markov Model (HMM) ChlP-seq simulator that can gen- 
erate sample reads at user-defined emission rates from spe- 
cified genomic loci. We simulated ChlP-seq and input data 
in triplicates for whole human chromosomes to represent 
scenarios in which different families of repetitive elements 
were enriched. To assess the generality of our results our 
simulations used different chromosomes, read lengths, 
and families of enriched repetitive elements (LI, Alu, and 
SVA). The HMM structure and parameters used in our 
simulations are described in Additional file 1: Figure S2. 
Additional file 1: Figure S3 shows a representative read 
alignment in a 35 kb region of chromosome 19 for a simu- 
lation in which retro transposons in the LI family were 
enriched with respect to background. 

In all our simulations we applied RepEnrich to com- 
pute the abundance, expressed in counts per million 
mapping reads (CPM), for all repetitive elements based 
on the three counting strategies. Because we knew the 
exact chromosomal location each read was sampled 
from in the simulation, we could unambiguously com- 
pute the true abundance of each repetitive element. 

Our simulations revealed clear differences in the per- 
formance of the three counting strategies. Figures 2A-C 
and Additional file 1: Figure S4A-C show the scatterplots 



of the RepEnrich CPM estimate versus the true abun- 
dance CPM for all repetitive element subfamilies the LI 
enrichment and Alu enrichment simulations respectively. 
The unique counting strategy (Figure 2A and Additional 
file 1: Figure S4A) tends to over- or under-estimate the 
true abundance of repetitive elements and thus intro- 
duces the most variance to the estimate. In addition, 
specific families of repetitive elements show a common 
bias; most notably SINEs are consistently underesti- 
mated. The total counting strategy performs better over- 
all but suffers from a strong bias in a few families of 
repetitive elements, such as SINEs and SINE-Variable 
Number Tandem Repeat-Alus (SVAs) elements, which 
are consistently overestimated (Figure 2B and Additional 
file 1: Figure S4B). The fractional counting strategy ap- 
pears to provide the optimal estimate: deviation from the 
true abundance is smallest for all subfamilies (Figure 2C 
and Additional file 1: Figure S4C). The largest deviations 
occurred for elements with smaller CPM values. Although 
some of the family-specific biases in the total counts are 
still present, they are greatly reduced and limited to ele- 
ments with low CPM values. 

We confirmed these observations by conducting three 
additional comparative analyses. First we applied multi- 
dimensional scaling to the four vectors containing the 
unique, total, fractional and true CPMs respectively 
(Figure 2D and Additional file 1: Figure S4D). The frac- 
tional count strategy is more similar to the true abun- 
dance as demonstrated by the smaller distance between 
these two in the multi-dimensional scaling plot. 
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Figure 2 Performance comparison of counting strategies on simulated LI -enriched data. Three replicates of ChlP-seq (50 bp single-end 
reads) data enrichment at L1 elements on chromosome 19 were simulated using the hidden Markov model (HMM) in Additional file 1: Figure S2. 
The expected average log2CPM for the simulation was computed using the repetitive element counts computed from the true read coordinates. 
The average log2CPM read abundances, computed by EdgeR from RepEnrich estimated count values using total, unique, and fractional count 
methods were compared to the expected true abundance. The solid line indicates y = x, values falling on the line are identical between the 
estimated average log2CPM and expected average log2CPM. The repetitive element subfamilies are colored according to class with small RNA 
repeats including scRNA, rRNA, snRNA, and tRNA classes. A) Comparison of the estimated abundance from the unique count method, which only 
sums reads that can be assigned uniquely to a single subfamily of repetitive elements, versus the true abundance. B) Comparison of the 
estimated abundance from the total count method, which sums the reads assigned to each repetitive element subfamily and allows for multiple 
counting of reads, versus the true abundance. C) Comparison of the estimated abundance from the fractional count method, which sums the 
reads that fall into each individual repetitive element subfamily once, but adds a fraction for reads mapping to more than one subfamily 

of repetitive element sub-families aligned), versus the true abundance. D) Multidimensional scaling (MDS) plot of the Euclidean distances 
between the average log2CPM values for the unique, total, and fractional count estimates of RepEnrich and the expected average log2CPM values. 
The fractional count average log2CPM estimate was closest to the true abundance. 



Next, we computed the deviation from the 45° line in Additional file 1: Figures S5A and S6A show the R-squared 
the scatterplots of estimated abundance vs. true abun- value for all elements combined and for each repetitive 
dance (Additional file 1: Figures S5B-D and S6B-D). element class separately in two sets of LI enrichment 
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simulations in with 2 M reads for two different chromo- 
somes. The R-squared values in the fractional count strat- 
egy were consistently close to 1 only in the case of the 
fractional count, and varied widely between 0 and 1 for the 
unique counts strategy. Comparison with the scatterplots 
in Figure 2, which were obtained from simulating 20 M 
reads, also indicate that the unique count strategy is more 
affected by read coverage than the other two methods and 
performs poorly at lower coverage. 

Finally, we assessed the ability of the various counting 
methods to reveal significant differences between two ex- 
perimental samples. To do so we compared the SVA, LI, 
and Alu enriched samples to their input samples via a 
Generalized Linear Model (GLM) fit to a negative bino- 
mial distribution (see Methods). We created a benchmark 
set of differentially enriched elements in each simulation 
by applying the GLM model to the true abundance counts, 
and compared this set to the elements detected as differen- 
tially enriched in each of the three counting strategies. In 
all simulations the fractional counting method recovered 
numerous significant repetitive elements that were identi- 
fied to be differentially enriched in the true abundance 
comparison benchmark; it also returned the least number 
of false positive (Additional file 1: Figures S7 and S8). 

To assure our observations were not restricted to in 
silico data we compared the performance of the fractional 
counting and unique counting methods on real ChlP-seq 
data. We utilized a ChlP-seq dataset for RNA polymerase 
II (Pol II) conducted in K562 cell-line (Additional file 2: 
Table SI), and applied the GLM to identify repetitive ele- 
ments enrichment for Pol-II with respect to input. Con- 
sistently with our simulations, the fractional counting 
method identified more elements as enriched for Pol-II 
with respect to the unique counting method (Additional 
file 1: Figure S9). 

Because the fractional counts displayed the least bias 
and variance in the estimation of repetitive element 
abundance, and was most similar to the true abundance, 
we chose to use fractional counts as the default counting 
strategy for RepEnrich. 

Experimental design 

To investigate transcription of different classes of repeti- 
tive elements in human cells we applied RepEnrich to a 
collection of publicly available RNA-seq and ChlP-seq 
datasets. We collected high-throughput sequencing data 
from the ENCyclopedia Of DNA Elements (ENCODE), 
the Gene Expression Omnibus (GEO) and the European 
Nucleotide Archive (ENA). A detailed list of individual 
samples can be found in Additional file 2: Table SI. 

Transcription in eukaryotes is performed by three dif- 
ferent RNA Polymerases, Pol I-IIL With the exception of 
Pol I, which specializes in ribosomal RNA (rRNA) tran- 
scription, both Pol II and III are known to transcribe 



repetitive elements [5,23]. Hence, to address the question 
of how repetitive elements are transcribed we utilized 
ChlP-seq data for Pol II, Pol III, and TFIIIB (a Pol Ill- 
associated transcription factor complex). ChlP-seq for 
TFIIIB subunits has previously been used as additional 
support of Pol III binding, because TFIIIB is necessary for 
Pol III promoter recognition [15,24]. For Pol II, we ana- 
lyzed ChIP datasets generated with a Pol II antibody that 
does not distinguish active and inactive enzymes, as well 
as an antibody to Pol II phosphorylated on serine 2 (Pol 
II S2), which is specific for the active elongating en- 
zyme. To our knowledge, no ChlP-seq dataset for RNA 
Pol I is currently available. 

We adopted a comprehensive approach that included 
the analysis of not only transposable elements, but also 
other classes of repetitive elements annotated within 
RepeatMasker, with the exclusion of simple sequence re- 
peats. Among the repetitive element classes we examined 
for Pol II and Pol III binding were the small structural 
RNAs and their processed pseudogenes. Small structural 
RNAs including tRNAs, snRNAs, and rRNAs are included 
in the Repeatmasker annotation because of the high de- 
gree of sequence homology to processed psuedogenes. 
Previous Pol III ChlP-seq studies indicated that the tRNA 
pseudogenes are occupied by Pol III [24], which is not sur- 
prising since tRNA Pol III promoters are internal. Some 
Pol II transcribed snRNA psuedogenes may also be tran- 
scribed, and have been found to be associated with Ll- 
encoded proteins [25]. 

To investigate the transcription and regulation of repeti- 
tive elements in a variety of cell types, we collected data 
from multiple cell lines. Specifically, our analysis included 
Pol II and III ChlP-seq performed with IMR-90 fibroblasts, 
K562 chronic myelogenous leukemia (CML) cells, HeLa 
adenocarcinoma cells and GM12878 lymphoblastoid cells, 
as well as additional RNA Pol Il-only ChlP-seq data for 
HUVEC (human umbilical vein) endothelial cells and per- 
ipheral blood-derived erythroblast cells (PBDE) purified 
from human blood samples (Additional file 2: Table SI) 
[15-17,24]. K562 and HeLa are cancer-derived transformed 
cell lines, GM12878 is an EBV- immortalized cell line, and 
IMR-90, HUVEC and PDBE are normal (non-immortal- 
ized) cells. 

Regulation and transcription of repetitive elements in 
human cells 

All ChlP-seq and RNA-seq data were processed with 
RepEnrich to generate counts for all repetitive element 
subfamilies. Log2-fold-changes between ChIP and input 
samples as well as the statistical significance were then 
evaluated using a generalized linear model (GLM) fit to 
a negative binomial distribution (see Methods). 

The repetitive elements that displayed the most shared 
pattern of Pol II binding between the cell lines were the 
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snRNAs (Figure 3A, B). This is consistent with the known 
universal role of snRNAs in RNA processing and their 
transcription by Pol II (with the exception of the U6 
snRNA, which is a Pol III transcript). Likewise, we ob- 
served ubiquitous Pol III binding to tRNAs and the 5S 
rRNA across all the cell lines examined (Figure 3C). 
Transposable elements rarely displayed consistency across 
all cell lines, and instead primarily displayed significant 
Pol II or Pol III binding in one or a few cell lines (Figure 3). 
This is at least partially explained by a tendency of retro- 
transposable elements to be expressed more highly in 
transformed versus normal (non-transformed) cell lines. 

One interesting feature we identified was significant 
co-occupancy of Pol II and Pol III at some repetitive ele- 
ments. When overlapped within the same cell line, 89 
repetitive element subfamilies were co-occupied by Pol 
II and Pol III (Figure 3D). The majority of these repeti- 
tive elements were tRNAs (Figure 3D). Because tRNAs 
are short and the reads near their borders map uniquely 
to the genome, we could examine tRNA elements in the 



genome browser for further evidence of Pol II and Pol 
III co-occupancy. We identified multiple instances where 
Pol II and Pol III were bound at or near the same tRNA 
gene (Additional file 1: Figure S10 A), and instances 
where snRNA genes, including Pol III transcribed U6, 
were co-occupied by Pol II and Pol III (Additional file 1: 
Figure S10 B). 

To further characterize the transcription of repetitive 
elements, we examined binding of Pol Ill-specific tran- 
scription factors, as well as RNA transcript subcellular 
localization and polyadenylation status. The tRNA tran- 
scriptional signature displayed evidence of Pol III tran- 
scription from a type II promoter (Additional file 1: 
Figure Sll, SI 2). As expected, the tRNA transcripts 
were localized to the cytosol and not polyadenylated 
(Additional file 1: Figure SI 2). Satellite repeat sequences 
displayed predominantly Pol II binding, although some 
subfamilies also displayed Pol III binding (Additional 
file 1: Figure SI 2). Satellite RNAs were predominantly nu- 
clear and polyadenylated, consistent with being primarily 
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Figure 3 RNA polymerase binding patterns to repetitive elements. A) RNA Pol II, B) active RNA Pol II S2, and C) RNA Pol III were assessed for 
binding to repetitive elements using generalized linear model (GLM) comparisons of ChIP versus input. To view the binding patterns we 
examined percent of repetitive element sub-families for the major classes of repetitive elements that displayed significant (FDR <0.05) positive 
enrichment (Log2FC >0). The color-coding corresponds to the number of cell lines that displayed the significant positive enrichment. The x-axis 
labels the class of repetitive element and the adjacent number indicates how many repetitive element sub-families fall within that class. D) The 
repetitive elements that displayed significant (FDR <0.05) positive enrichment (Log2FC >0) for RNA Pol II and RNA Pol III were compared for 
overlap across the same cell line. The 89 repetitive elements that displayed co-enrichment within the same cell line for RNA Pol II and RNA Pol III 
were then examined for representation of the major classes of repetitive elements, expressed as a percent. 
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Pol II transcripts (Additional file 1: Figure S12). The major- 
ity of transposable elements did not display strong tran- 
scriptional signatures (Additional file 1: Figure SI 2, SI 3). 
Most notably, LINE retrotransposons, the major active 
class of retrotransposons in humans, displayed very few 
subfamilies with significant binding of Pol II or Pol III 
(Additional file 1: Figure SI 3). DNA transposable elements, 
which are believed to be inactive in the human genome, 
also displayed few subfamilies with Pol II or Pol III enrich- 
ment (Additional file 1: Figure SI 3). 

SINE elements, predominantly represented by Alu sub- 
families, displayed some genome-wide enrichment for Pol 
II and III binding; the Pol II binding may be due to the 
high representation of Alus within gene introns (Additional 
file 1: Figure S12). Similar to Canella et al we observed sig- 
nificant binding of Pol III to SINE elements, likely repre- 
senting independent SINE transcription [15]. SINE RNAs 
displayed a cytosolic and non-polyadenylated enrichment 
pattern, which is consistent with SINE elements being 
transcribed from internal Pol III promoters (Additional 
file 1: Figure S12) [5]. By far the most transcriptionally ac- 
tive endogenous retrotransposons we observed were in 
the LTR family (Additional file 1: Figure S13). Many LTR 
elements displayed significant binding by Pol II, and some 
also displayed enrichment for Pol III (Figure 3, Additional 
file 1: Figure S13). As noted above, the majority of LTR 
retrotransposon subfamilies that displayed polymerase 
binding did so in one or a few cell lines. 

The endogenous retrovirus HERV-Fcl is actively 
transcribed by Pol II in a CML cell line 

Among the LTR elements, numerous elements displayed 
Pol II enrichment that was significant in at least one 
cell-line (FDR < 0.05, for a full list see Additional file 1: 
Figure S14). One element that displayed a particularly 
striking binding was the internal portion of HERV-Fcl, 
most prominently in K562 CML cell-line. We chose to 
focus on K562 cell-line for additional analysis because 
the internal region of HERV-Fcl displayed 7- and 15- 
fold enrichment for Pol II and Pol II S2 in this cell-line 
(Figure 4A). To further examine the behavior of HERV- 
Fcl in K562 cells we applied RepEnrich to ENCODE 
ChlP-seq data for histone marks associated with active 
euchromatin (H3K27ac, H3K4me2, H3K9ac, H3K4me3, 
H3K79me2, H3K4mel, H3K36me2) and repressed het- 
erochromatin (H3K9mel, H3K9me3, H3K27me3). We 
found that the HERV-Fcl element, especially its internal 
region, was highly enriched for marks associated with 
active transcription and depleted for marks associated 
with repression (Figure 4B). These results indicate dere- 
pression of the HERV-Fcl retrotransposon in the K562 
CML cell line. 

The HERV-Fcl subfamily is represented by few copies in 
the human genome, and its internal region, HERV-Fcl-int, 



has only seven copies in the hgl9 build. We therefore ex- 
amined all the genomic loci of HERV-Fcl in K562 cells 
using the UCSC genome browser and ENCODE tracks. A 
single HERV-Fcl internal element on chromosome 7 dis- 
played RNA expression from the minus strand in K562 
cells but not in any other ENCODE cell line for which 
PolyA + RNA-seq is available (Figure 4C). This region also 
displayed binding for Pol II and active Pol II-S2 as well as 
the TATA-box binding protein (TBP). We also noted the 
binding of MAFK, MAFF, and NFE2 transcription factors 
at the promoter of the HERV-Fcl element. 

LI retrotransposons are significantly overexpressed in 
prostate tumor tissue 

Somatic retrotransposition events were recently reported 
in several cancers [26]. We therefore examined normal and 
transformed cell lines for Pol II binding and tested whether 
transformed cells displayed more permissive binding of Pol 
II to retrotransposons. Our results indicated that a larger 
number of transposable elements show at least 1.5-fold en- 
richment for Pol II in HeLa, K562, and GM12878 trans- 
formed cells than in PBDE, IMR90, and HUVEC normal 
cells (Figure 4D). This is especially true for LTR retro- 
transposons. Hierarchical clustering of Pol II binding 
for LTR elements with significant enrichment in at least 
one cell line revealed that normal cell-lines clustered 
separately from cancer and transformed cells (Additional 
file 1: Figure S14). We thus wanted to investigate further 
whether the increased Pol II binding in transformed cell 
lines contributed to increased expression of transposable 
elements. However, in this dataset the transformed and 
normal cells were derived from a variety of tissues and 
hence direct comparison of retrotransposon transcription 
was not possible. 

To better control for individual and tissue-specific ex- 
pression differences, we tested our hypothesis using a 
RNA-seq tumor dataset that contains data for matched 
prostate tumor and normal tissue from 14 patients with 
different grades of prostate cancer [21]. We detected 475 
retrotransposon subfamilies that exhibited significant dif- 
ferential expression in tumor tissue (FDR < 0.05), preva- 
lently from the LTR, LINE and DNA classes (Figure 5A). 
Interestingly, very few SINE subfamilies were differentially 
expressed in prostate tumor versus normal tissue. Most of 
the LTR subfamilies were endogenous retroviruses, with 
ERV1 being the most represented. Of the ERV1 family, 53 
subfamilies were overexpressed and 51 subfamilies were 
under-expressed (Figure 5D). Most of the differentially 
expressed DNA elements belonged to the hAT-Charlie and 
TcMar-Tigger families, and the vast majority of them (59 
out of 66) were significantly under-represented in tumor 
tissue (Figure 5B). For LINEs, 99 out of 107 subfamilies 
belonged to the LI family of retrotransposons, and the vast 
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Figure 4 HERV-Fcl and Pol II binding in transformed vs. normal cell lines. LTR and other transposable elements displayed differences in 
RNA Pol II binding in transformed versus normal cell lines. A) The LTR subfamily HERV-FC1 displayed cell line specific transcriptional profiles for 
the LTRs (LTR1-3) or internal region (int) of HERV-FC1 . The GLM results are plotted as log2FCs for Pol II enrichment and differential RNA-seq 
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D) The count of transposable elements displaying modest positive enrichment, log2FC >1.5, in transformed versus normal cell lines. The counts 
are colored by the class of transposable element. 



majority of these (97 out of 99) were overexpressed in 
tumor tissue (Figure 4C). 

LI LINEs are the most active retrotransposons in 
humans and their retrotransposition was recently docu- 
mented in multiple cancers [26-28]. Figure 6A shows a 
heatmap of the log2 fold changes between tumor and nor- 
mal tissue for evolutionarily recent primate and human- 



specific LI subfamilies that displayed statistically signifi- 
cant differences. We applied bi-clustering (see Figure le- 
gend) and identified two major groups of patients. Group 
1 showed a marked overexpression of the primate-specific 
Lis, while group 2 showed a lower level of overexpression 
and in some cases underrepresentation. Patient 8 appeared 
to be an outlier. We studied the association of these two 
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Figure 5 Repetitive elements differentially expressed in prostate cancer tissue. (A) Classes and families of repetitive elements differentially 
expressed in prostate cancer tumor tissue versus normal tissue. The number next to each class and family name corresponds to the number of 
differentially expressed subfamilies (FDR < 0.05). (B-D) Expression fold-change between prostate cancer tumor tissue and normal tissue computed 
by the GLM on the 14 patients. The most represented family of DNA, LINE and LTR elements are shown. 



groups with some clinical parameters available for each 
patient [21], We detected no association with patient 
age or preoperative PSA, but a significant association 
with the stage of the cancer: group 1 patients showed a 
more advanced cancer state with respect to group 2, as 
defined by the TNM score (p = 0.04, Mann Whitney U 
test; Figure 6A). 



Interestingly, all the novel somatic retrotransposition 
events identified in prostate cancer [26] belonged to the 
sub-families of Lis that displayed significant enrichment 
in our dataset (Figure 6A). In particular, 17 of them were 
from the human-specific LIHs subfamily. Hence, we ex- 
amined the LIHs elements more closely by mapping all 
RNA-seq reads to the LIHs consensus using Bowtie2 
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Figure 6 Primate-specific LI elements are overexpressed in a subclass of patients with more advanced tumor progression. 

(A) Clustering of log2 expression fold-changes in the subset of primate specific Lis that showed significant differential expression reveals two 
major classes of patients (Group 1 and Group 2). Group 1 shows widespread overexpression of primate specific Lis and contains patients with 
more advanced tumor progression. The number of somatic insertions refers to the number of previously reported somatic retrotransposition 
events for that L1 subfamily identified in prostate cancer [26]. (B) All L1 sequences in the human genome were fetched and mapped to LI Hs 
consensus using permissive, local alignment parameters to analyze data. Using this distribution we computed the cumulative distribution of start 
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consensus element. (C) Coverage of L1 sequences in prostate tumor versus normal RNA-seq that map to LI Hs consensus using a local alignment 
(Bowtie2). The log2FC was computed for each position along the L1Hs consensus from tumor and normal-matched RNA-seq coverage. 
Hierarchical clustering was done based on the log2FC using Euclidean metrics. 



local alignment mode. This method is not entirely specific 
to LIHs as closely homologous L1PA elements are also 
represented. The LIHs subfamily and its closely related 
primate-specific LI PA subfamilies are composed of gen- 
omic instances that are 3 ' biased as a consequence of a 5 ' 
truncation that frequently occurs during retrotransposi- 
tion [29] (Figure 6B, top panel). The fold-change in cover- 
age along the LIHs consensus between tumor and normal 
tissue was increased 2- to 4-fold across the entire length 
of the element, including the 5' UTR region in patient 
group 1. This is consistent with transcription of elements 



in the genome that are full length or close to full length. 
We also observed interesting and conserved patterns of 
fold-changes. For example, patients 1, 10 and 13 in group 
1 show dipping at 4 locations corresponding to LI ORF1 
and ORF2, while patient 11 in the same group displayed 
the opposite behavior. 

Many repetitive element insertions, including those of 
LI and Alu [30], are found in the introns of genes. The 
starting material for most RNA-seq libraries is poly-A 
purified total cellular RNA, which is predominantly ma- 
ture mRNA that is free of introns. However, a small 
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fraction of total cellular RNA is composed of pre-mRNA, 
also known as heterogeneous nuclear RNA (hnRNA), 
which also contains intronic sequences and can be polya- 
denylated. Hence, some of the reads assigned to repetitive 
elements could have originated from this small hnRNA 
pool To address this we examined separately the mapping 
of unique LIHs and LI PA reads to intronic and intergenic 
regions, and found very similar tumor-associated in- 
creases in the abundance corresponding to both regions 
(Additional file 1: Figure S15). Hence, the increased tran- 
scription of LI elements in prostate tumors appears to 
affect equivalently elements inserted outside of known 
genes, and those inserted within introns. 

Discussion 

The majority of the human genome is comprised of re- 
petitive sequences, most of which are represented by 
parasitic retrotransposon elements. Recent years have 
seen increased interest in understanding their regulation 
because of the important roles in genome evolution, de- 
velopment, and disease [31-35]. A prolific expansion of 
sequencing data, combined with new experimental and 
computational methods in genomics and transcriptomics, 
have spurred an extensive exploration of chromatin regu- 
lation, and the temporal and spatial organization of the 
RNA transcriptome. In spite of these new technologies, 
fundamental computational obstacles remain for the ana- 
lysis of repetitive elements in the short-read data produced 
by high-throughput sequencing. This is because short 
reads of repetitive elements align ambiguously and cannot 
be assigned to unique locations in the genome. 

We wanted to develop a computational pipeline to esti- 
mate enrichment and differential expression of repetitive 
elements in ChlP-seq and RNA-seq datasets. Because sig- 
nal from repetitive elements in many cases is likely to be 
weaker than from genes, as a consequence of their low 
level of activity, we favored a strategy that assigned reads 
to repetitive element subfamilies as opposed to individual 
instances. Previous work excluded reads that map to more 
than one repetitive element subfamily [13]. This approach 
can be problematic, because some individual elements are 
highly conserved. For example, multiple sequence align- 
ment of the consensus sequence for primate specific Lis 
reveals a high degree of homology between individual ele- 
ments, despite the fact that these consensus sequences rep- 
resent distinct repetitive element subfamilies (Additional 
file 1: Figure SI 6). Many multi-mapping reads tend to align 
with multiple repetitive element subfamilies (Additional 
file 1: Figure SI 7). Our tests of this counting strategy indi- 
cated that exclusion of reads mapping to more than one 
repetitive element subfamily would exclude 64% of 30 bp 
repetitive mapping reads and 51% of 50 bp repetitive map- 
ping reads (Additional file 1: Figure SI 7). Furthermore, re- 
quiring unambiguous assignment of reads to individual 



subfamilies will introduce a bias towards less conserved re- 
peats, which will be assigned relatively higher counts. 

To assess how to optimally count reads that map to 
more than one subfamily, we used in silico ChlP-seq data 
simulations where the true abundance of repetitive ele- 
ments was known. Double counting reads mapping to 
multiple subfamilies {total counting approach) tended to 
overestimate enrichment of Alu and SVA elements, while 
excluding those same reads (the unique counting method 
used by Day et al) introduced a similar bias but in the op- 
posite direction, as well as a larger variance in the count 
estimate (Figure 2A). These biases are likely a conse- 
quence of the high degree of sequence homology between 
subfamilies, and are particularly evident in the Alu and 
SVA families. Alu emerged relatively recently in primate 
evolution (-60 million years ago), and thus displays a high 
degree of sequence homology between subfamilies [36]. 
SVA elements are also highly homologous as they arose 
even more recently in hominid evolution [37]. A third 
counting strategy, based on assigning fractional values to 
each read mapping to multiple subfamilies (fractional 
counting approach), reduced both the bias and variance of 
the estimate. It most closely approximates the true abun- 
dance, and recovers more differentially enriched elements 
in both simulated and real data. Hence we selected frac- 
tional counting as the optimal strategy. 

Based on this analysis we developed a new computational 
pipeline, RepEnrich, for genome-wide studies of repetitive 
elements in ChlP-seq and RNA-seq high-throughput data. 
Our methodology extends existing strategies by utilizing all 
mappable reads in estimating read counts. RepEnrich is a 
flexible pipeline that can readily incorporate different se- 
quence aligners, multiple sequencing data types, and can 
easily interface with existing statistical packages for down- 
stream analysis. We demonstrate the utility of RepEnrich 
here by examining a large collection of high-throughput 
datasets to analyze transcriptional regulation of repetitive 
elements in multiple cell lines and human tissues. 

RepEnrich readily documented, in a genome-wide man- 
ner, several known aspects of the transcriptional activity of 
repetitive elements, especially small structural non-coding 
RNAs such as tRNAs, snRNAs, and rRNAs. As expected, 
tRNAs were predominantly transcribed by Pol III from a 
type II promoter and were predominantly enriched in 
the non-polyadenylated fraction and in the cytosol. The 
snRNAs were observed to be bound by Pol II, in agree- 
ment with their known transcriptional mechanism. One in- 
teresting observation was that many small structural non- 
coding RNAs, especially tRNAs, displayed co-occupancy of 
binding by Pol II and Pol III (Figure 3D). While the co- 
binding of Pol II and Pol III to small structural non-coding 
RNAs has been described previously at specific genomic 
locations [17], our results suggests such association occurs 
genome-wide. 
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Polymerase binding to small structural non-coding RNA 
elements was observed to be wide-spread across all the cell 
lines examined, which is consistent with their core roles 
in basic biological processes. Very low levels of polymerase 
enrichment were found at LINEs and DNA transposons, 
which is likely a consequence of their constitutive repres- 
sion by DNA methylation and heterochromatin silencing 
mechanisms [5]. SINEs and LTR elements showed signifi- 
cant polymerase binding that was typically restricted to one 
or a few of the cell lines examined (Figure 3 A, B and C). 
The LTR subfamilies were the most active retrotransposa- 
ble elements, with a general trend towards increased poly- 
merase binding in transformed cells (Figure 4D). 

Although LTR retrotransposons are thought to be mostly 
inactive in humans, and very few cases of novel germ-line 
and somatic retrotranspositions have been reported [5,26], 
our results are consistent with recent genome-wide studies 
of chromatin accessibility. Analysis of DNase I hypersensi- 
tive sites (DHS), markers of accessible chromatin, revealed 
many cell line-specific changes mapping to retrotrans- 
posable elements [38]. In particular, LTR retrotransposons 
displayed the majority of DHS changes, many of which 
correlated with changes in chromatin accessibility. Evi- 
dence has also emerged that LTR elements might function 
as enhancers [39,40]. Similarly, our results suggest that 
LTR retrotransposons are bound by RNA polymerases and 
are transcribed in a cell line-specific manner. 

Among the LTR elements with Pol II binding, the en- 
dogenous retrovirus HERV-Fcl displayed a large-degree of 
Pol II and Pol II S2 enrichment with the most prominent 
binding in a K562 CML line. Active Pol II transcription 
was also supported by RNA-seq enrichment in the polya- 
denylated fraction, as well as enrichment of several chro- 
matin activation marks in ChlP-seq data. Although the 
HERV super-family of retrotransposons is not thought to 
be active for retrotransposition, several of its members 
have been associated with multiple diseases. For example, 
increased transcription of HERV-K family members has 
been reported in amyotrophic lateral sclerosis (ALS) [25], 
CML [41], and recently in multiple sclerosis (MS) [42]. 
Seven HERV-Fcl elements are currently annotated, and 
we were able to identify a single genomic locus represent- 
ing the source of most of the ChlP-seq and RNA-seq sig- 
nal. Interestingly, at this locus we detected enrichment for 
binding of the TATA-box binding protein (TBP), as well 
as the MAFK, MAFF, and NFE2 transcription factors. The 
MAF family transcription factors contain mutations that 
are associated with CML [43] and heterodimerize with 
NFE2 [8]. These binding sites might be exposed due to 
loss of silencing at repetitive genomic regions in the K562 
cancer cell line, consistent with evidence that loss of DNA 
methylation can strongly activate HERV-Fcl [44]. 

One striking result of our analysis was that trans- 
formed cell lines consistently displayed a wider pattern 



of Pol II enrichment than normal cells (Figure 4D). A re- 
cent report on genome-wide changes in chromatin ac- 
cessibility in embryonic stem cells (ESC), differentiated 
cells, and cancer cells may shed some light on our obser- 
vations [45]. As ESCs differentiate into various cell types, 
the proportion of shared DHSs decreases, however, can- 
cer cells gain back many of the DHSs originally found in 
ESCs. It was suggested that cancer cells adopt a more 
accessible chromatin landscape, similar to ESCs. Al- 
though this particular study did not look specifically at 
retrotransposons, combining this model with our re- 
sults on Pol II binding in transformed cells suggests that 
genomic regions harboring transposable elements might 
be globally de-repressed and increase their transcrip- 
tional activity in cancer. 

To further examine the transcriptional activity of retro- 
transposons in cancer, we examined RNA-seq data from 
prostate tumors [21]. Many repetitive element families 
were differentially expressed in prostate tumors, with most 
of the changes occurring within LINE, LTR, and DNA 
class elements. LINE elements displayed a striking ten- 
dency to be upregulated in prostate tumors. A closer look 
at LI regulation revealed that patients could be separated 
into two groups based on their transcriptional profiles 
(Figure 6A). We found that patients in group 1 showed 
higher levels of LI expression in their tumors and, on aver- 
age, were diagnosed with a more advanced stage of cancer. 
Recently, novel somatic retrotransposition events have 
been identified in several different cancers, including ovar- 
ian, prostate, hepatocellular, and colon [26-28] . The major- 
ity of these new events involved evolutionarily recent 
human-specific LIHs, primate-specific LI PA and Alu ele- 
ments. For prostate cancer, 26 out of 28 new retrotranspo- 
sitions identified [26] belonged to the LIHs and L1PA 
families that were also significantly upregulated in our ana- 
lysis. Because only full-length elements are competent for 
retrotransposition and the majority of LIHs elements in 
the genome are 5' truncated, we further studied changes 
in read coverage along the entire consensus LIHs se- 
quence. We found that tumors of group 1 patients showed 
a 2-fold (or greater) increase in read coverage and that 
read coverage was elevated equivalently across the entire 
element including the 5' end. This suggests that the in- 
crease in transcription involved predominantly full-length 
elements and was initiated at the LI promoter. 

Conclusions 

In summary, our study underscores the richness of in- 
formation on the transcriptional regulation of repetitive 
elements, and transposable elements in particular, con- 
tained in publically available, high throughput sequen- 
cing datasets. Because the amount of this information is 
expected to vastly increase in the near future, dedicated 
computational pipelines, such as RepEnrich, will be of 
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great utility in mining these datasets. RepEnrich allows for 
the analysis of repetitive elements in any organism with a 
reference genome available that has repetitive element an- 
notation (such as Repeatmasker annotation). RepEnrich 
also allows for a custom repetitive element annotation, 
which can be used for a variety of applications where 
multi-mapping reads become an issue such as gene clus- 
ters repeats that appear in tandem duplicates. 

Our study also supports the importance of activation of 
endogenous retrotransposons as an important, and prob- 
ably universal, feature of cancer. Whether retrotranspo- 
sable elements are drivers or passengers of the cancer 
development process is still an open question and will re- 
quire further investigation. In addition, we suggest that 
they will have considerable utility as biomarkers, and in 
combination with other genomic features, will help in elu- 
cidating cancer subtypes, progression and prognosis. 

Methods 

Analysis of repetitive element enrichment using RepEnrich 

Sample reads were aligned to the genome using Bowtiel 
with the requirement that reads map uniquely, command = 
bowtie hgl9 -p 16 -t -m 1 -S -chunkmbs 512 -max multi- 
map.fastq inputfastq outputsam [46]. Reads mapping to 
multiple locations of the genome were assigned to a separ- 
ate FASTQ file (i.e. -max). Annotation was constructed 
from RepeatMasker annotated genomic instances of repeti- 
tive elements (downloaded from Repeatmasker.org). The 
genomic coordinates of repetitive elements were used to 
build repetitive element psuedogenome assemblies for each 
distinct repetitive element subfamilies. Repetitive element 
psuedogenome assemblies were built by concatenating gen- 
omic instances of each repetitive element subfamily, their 
flanking genomic sequences (default = 15 bp), and a spacer 
sequence (default = 200 bp) in FASTA format, in a manner 
similar to Day et al [13]. These psuedogenomes were 
indexed using Bowtie. A genomic feature file was also built 
in BED format, which describes the coordinates of all anno- 
tated repetitive element instances. The genomic feature files 
in BED format and the distinct repetitive element psue- 
dogenome assemblies in FASTA format were used to se- 
parately analyze the unique mapping reads and the reads 
mapping to more than one location. Reads mapping to 
unique genomic positions were sorted based on overlap 
with repetitive element genomic instances. To conduct the 
overlap we used Bedtools to intersect the alignment file and 
the genomic instances of repetitive elements [47]. Reads 
that map to more than one location are categorically 
aligned to the repetitive psuedogenome assemblies using 
Bowtie. For paired-end reads, each mate pair is separately 
mapped to the repetitive psuedogenome assemblies. RepEn- 
rich systematically tracks all repetitive element subfamilies 
a given read aligns for all reads. We can determine the 
number of reads mapping to repetitive element subfamilies, 



repetitive element families, or repetitive element classes. 
RepEnrich uses three separate ways of classifying the reads 
that map to multiple repetitive element subfamilies: total 
counts, unique counts, and fractional counts. The total 
counts output sums all reads that map to an individual re- 
petitive element subfamily. The unique counts output sums 
only reads that can be uniquely assigned to a single repeti- 
tive element subfamily, similar to the output of Day et. al. 
[13]. The fractional counts sums reads mapping uniquely to 
a repetitive element subfamily once and counts reads map- 
ping to multiple subfamilies using a fraction 1/N S , where 
N s = number of repetitive element subfamilies the read 
aligns with. The fractional count rounds the estimate for a 
subfamily to the nearest integer and is the default method 
used by RepEnrich. 

Availability 

The RepEnrich tutorial and source code is available for 
download at our github repository https://github.com/ 
nerettilab/RepEnrich. RepEnrich supports analysis for 
ChlP-seq and RNA-seq for any organism where a refer- 
ence genome and repetitive element annotation (such as 
Repeatmasker annotation) is available. RepEnrich also 
supports custom repetitive element or repeat feature an- 
notation in bed format. 

Simulation of ChlP-seq datasets 

To conduct the ChlP-seq simulation we developed a hid- 
den Markov model (HMM) that simulates separate states 
for different genomic features over the length of a chromo- 
some. The strategy is similar to approaches used for previ- 
ous studies addressing ChlP-seq simulation, however, we 
extended these methods to cover an entire chromosome 
and to use underlying information about genomic features 
[22]. The output for the HMM is the probability that a 
read is selected from a given genomic position in a ChlP- 
seq experiment. This probability is derived from the emis- 
sion state profile generated by the HMM. The transition 
matrix for the HMM simulates whether a given base pair 
along the length of the chromosome is in a high or low 
emission state. The simulator was built such that differen- 
tial enrichment profiles could be generated by defining the 
coordinates of repetitive elements, or other genomic fea- 
tures. To simulate enrichment over a repetitive element, 
we specified a transition state probability matrix that 
yielded more frequent occupancy of the high emission 
state for their coordinates. The output for the simula- 
tion is the true start positions of all the simulated reads. 
We then generated reads from the start positions in 
FASTA format. 

We used the ChlP-seq simulation to evaluate the pre- 
dictive power of RepEnrich. To test the repetitive element 
analysis we simulated ChlP-seq data on human chromo- 
somes 5, 10, and 19 (see Additional file 1: Figure S2 for 
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HMM parameters). For the simulation of separate chro- 
mosomes we used only RepeatMasker genomic instances 
present on human chromosomes we examined (build 
hgl9). We simulated ChlP-seq data for three experimental 
comparisons and six experimental conditions. We exam- 
ined conditions where LI, Alu, and SVA family retro tran- 
sposons were enriched and conditions where the LI, Alu, 
and SVA family retrotransposons were near background, 
considered an input. Each condition was simulated in trip- 
licate with a parameter to introduce technical variance. 
For chromosome 19 we simulated a situation with high se- 
quencing depth (twenty million reads) at three read 
lengths (30, 50, and 100 base pairs). For chromosomes 5 
and 10 we simulated a situation with lower sequencing 
depth (two million reads) at three read lengths (35, 50, 
and 75 base pairs). Simulated reads were aligned uniquely 
to human chromosome 19 and reads mapping to multiple 
locations were output to a separate FASTA file. Repetitive 
element enrichment was determined by RepEnrich. The 
expected abundance of repetitive element enrichment was 
determined for the various conditions using the true pos- 
ition of the simulated reads. The simulated positions of the 
reads were also used to generate the true alignment file, in 
bam format, as if all the multi-mapping reads had mapped 
uniquely. Using the true positions the expected count for 
each repetitive element subfamily was determined by over- 
lapping the reads with the genomic coordinates of each re- 
petitive element subfamily using Bedtools [47]. 

Using the read counts determined by RepEnrich, frac- 
tional, unique, and total counting methods and the ex- 
pected count we calculated the normalized read abundance 
or CPM and conducted differential enrichment analysis. 
To do so, the various count estimates generated by 
RepEnrich were analyzed using EdgeR bioconductor pack- 
age for statistically significant enrichment of repetitive ele- 
ments in simulated ChlP-seq conditions [48]. EdgeR uses 
a generalized linear model (GLM) to identify differential 
enrichment by fitting the genomic count data to a negative 
binomial distribution. Recent work extends the use of 
EdgeR from RNA-seq analysis of differential expression to 
diverse types of genomic count data arising from ChlP- 
seq experiments [49]. The data were first normalized using 
trimmed mean of M- values (TMM) normalization method 
and manually inputted total mapping reads [50]. Using 
Edger built-in functions we could then compute the nor- 
malized read abundance. EdgeR was then used to make a 
pooled comparison LI, Alu, SVA enriched samples versus 
input samples, where LI, Alu, SVA were at background 
levels (see EdgeR tutorial for pooled comparisons). EdgeR 
analysis yielded the log2 fold changes for ChIP with re- 
spect to input and an associated p-value for each repetitive 
element subfamily. The p values were corrected using an 
FDR correction using the method described by Storey 
et al. [51]. 



Analysis of ENCODE ChlP-seq datasets for enrichment to 
repetitive elements 

Raw data for RNA Polymerase ChlP-seq experiments was 
downloaded in FASTQ from the ENCODE data consor- 
tium or the European Nucleotide Archive (for complete list 
see Additional file 2: Table SI) [14-17,24,52]. TFIIIB factor 
components Bdpl, Brfl, Brf2, and SNAP45 ChlP-seq data 
was obtained from ENCODE and published datasets 
[15-17]. K562 ChlP-seq data for active and repressed chro- 
matin marks was downloaded from ENCODE data con- 
sortium [53]. ChlP-seq and input samples were mapped 
uniquely to the genome (build hgl9) using Bowtiel short 
read aligner [46]. Repetitive element analysis was con- 
ducted as described above using RepEnrich software. The 
fractional count output of RepEnrich was used for analysis 
of RNA Pol II and III ChlP-seq data. The raw fractional 
counts generated by RepEnrich for RNA polymerases in 
human cell lines was analyzed using EdgeR bioconductor 
package for statistically significant enrichment of repetitive 
elements in ChlP-seq samples with respect to input [48]. 
The data was first normalized using TMM normalization 
method and manually inputted total mapping reads [50]. 
EdgeR was then used to make a pooled comparison be- 
tween RNA Polymerases ChlP-seq versus input using cell 
line as an independent factor. EdgeR analysis yielded the 
log2 fold changes for ChIP with respect to input and an 
associated FDR value. 

Detecting transcripts from repetitive elements in ENCODE 
RNA-seq experiments 

The RepEnrich method was extended to the analysis of re- 
petitive element reads present in RNA-seq data. Three cell 
lines were chosen to complement the analysis of RNA 
polymerases and TFIIIB subunits: GM12878, HeLa, and 
K562 cells. The RNA-seq data for GM12878, HeLa, and 
K562 cells was generated as part of the ENCODE project 
[18-20,54]. The data includes three sub-cellular compart- 
ments including total RNA, cytosol, and nucleus. For each 
cellular compartment we examined PolyA selected and 
non-PolyA selected RNAs using duplicate samples. The 
GM12878, Hela, and K562 cells were sequenced using 75 
base pair paired-end reads. The analysis serves as an ex- 
ample of how RepEnrich can also be applied to paired-end 
data. Reads for all samples were trimmed to 50 base pair 
paired-end, to avoid inconsistency in sequencing quality 
present at 3' distal end of reads from different samples. 
All reads from each RNA-seq sample were mapped 
uniquely to the human genome (build hgl9) using Bow- 
tiel. We used Bowtiel for the analysis of RNA-seq because 
repetitive element reads that map specifically to a splice 
junction may be unreliable and highly ambiguous. By using 
Bowtiel rather than Tophat we simply excluded splice- 
junction reads from our analysis. The alignments were an- 
alyzed using RepEnrich and the fractional count output. 
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Downstream analysis was similar to the analysis of ChlP- 
seq data using EdgeR, with two key differences. First the 
manually inputted library sizes were obtained by calcu- 
lating the total mapping reads of STAR alignment BAM 
files available through ENCODE data consortium using 
samtools [48,55,56]. To identify significant differences in 
subcellular compartments we built a GLM in EdgeR and 
conducted comparisons within K562, HeLa, and GM12878 
cell lines between the various compartments (all compari- 
sons described in Additional file 2: Table SI). We decided 
to treat cell line as a separate factor instead of a covariate 
due to improved performance of the edgeR GLM model, 
although both approaches yielded similar results. 

Differential RNA expression analysis of repetitive element 
subfamilies in prostate cancer 

RNA-seq data from 14 prostate tumors and paired nor- 
mal tissue was analyzed as follows [21]. The 90 bp 
paired-end RNA-seq reads were mapped uniquely to the 
human genome (build hgl9) using Bowtiel. RepEnrich 
fractional counts were analyzed using EdgeR as was 
done for ENCODE RNA-seq data. The published total 
mapping reads for the study were inputted to EdgeR. To 
identify repetitive element subfamilies with significant 
differences in tumor versus control we built a paired 
GLM in EdgeR using individual as a covariate. The FDR 
corrected significance values were obtained for the com- 
parison between tumor and normal tissue. In addition, 
we also calculated the log2 fold change for each individ- 
ual tumor vs. normal matched tissue using the normal- 
ized count values. 

Visualizing coverage along a single repetitive element 
subfamily consensus 

To better examine coverage of repetitive element subfam- 
ilies along the full length of the elements we built RepCon- 
sensus, an extension of previous efforts to characterize 
read coverage with respect to a consensus element with 
added visualization tools [12]. RepConsensus is a package 
independent of RepEnrich that can be used to visualize 
coverage of reads along a consensus element. Alignment 
parameters needed to be more relaxed such that reads 
containing SNPs can still map to the consensus element 
and reads that contain adjacent non-repetitive genomic se- 
quence may also map. Consensus elements were down- 
loaded from RepBase.org, including the human-specific LI 
element LIHs. To align reads to the LIHs consensus we 
used Bowtie2 local alignment mode (bowtie2 -no-unal -p 
16 -N 1 -local -x LIHs -1 pairl.fastq -2 pair2.fastq -S 
out.sam). Local alignment mode can soft-clip the reads to 
allow alignment, which helps align reads that may contain 
adjacent non-repetitive genomic sequence. The -N 1 op- 
tion allows for up to one mismatch in the seed sequence, 
which aids in the mapping of reads containing SNPs 



different from the consensus. We also build the back- 
ground distribution of LI family element genomic in- 
stances that map to the LIHs subfamily consensus using 
these parameters. This is done to understand the degree 
with which other highly related subfamilies (such as evolu- 
tionarily recent primate-specific LI PA subfamilies) also 
map to the LIHs consensus. In addition, we can determine 
the background distribution of LI element lengths in the 
genome. We map all the LI family genomic instances to 
LIHs using the same parameters. Then we calculate the 
cumulative distribution of LI genomic instances start and 
end sites with respect to the length of the LIHs element. 
This reveals a preponderance of 5' truncated elements 
consistent with what is known about LI insertions, how- 
ever, few elements contain 3' truncations [57]. The infor- 
mation regarding the start and end sites along the LIHs 
consensus is important when interpreting RNA-seq align- 
ment to the consensus. To do the analysis of RNA-seq 
data for prostate cancer, we mapped all the data for tumor 
and normal matched control to the LIHs consensus. Then 
we computed the coverage along the LIHs consensus 
using bedtools. The data were normalized by the total 
mapping reads and the paired calculation of log2 fold 
change was computed along the length of LIHs consensus 
for each individual tumor. 

Investigating the genie vs. intergenic contribution of LIHs 
and LI PA RNA-seq transcripts in prostate cancer 

To approximate the genie and intergenic contribution of 
transcripts we examined the reads that mapped uniquely to 
the genome. We defined LIHs and L1PA coordinates that 
overlapped 99% within gene bodies and 99% overlapping 
with intergenic regions using bedtools and Refseq hgl9 
gene annotations. Next we computed the coverage for 
genie LIHs and LI PA elements and intergenic LIHs and 
LI PA elements using bedtools. We summed the coverage 
for genie LIHs and LI PA elements and intergenic LIHs 
and LI PA elements and then computed the counts per mil- 
lion for these two values without TMM normalization 
using the total mapping reads. Finally for the paired tumor 
and normal matched control we computed the log2FC for 
tumor vs. normal from the normalized log2CPM values. 

Availability of additional files 

All data presented in this study was previously published 
and is publicly available. For detailed summary of samples 
used see Additional file 2: Table SI. The data is available 
online through the ENCODE consortium (http://genome. 
ucsc.edu/ENCODE/). Published datasets are available 
through the NCBI Gene Expression Omnibus. Oler, A.J. 
et al. [24] accession number: GSE20309, Canella, D. 
et al. [15] accession number: GSE18184, and through 
the European Nucleotide Archive Ren S. et al. [21] ac- 
cession number: ERP000550. 
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Additional files 



Additional file 1: Figure SI. RepEnrich read counting strategies. 

Figure S2. HMM parameters used in the ChlP-seq simulations. 
Figure S3. Genome browser view of simulated data. Figure S4. Comparison 
of counting strategies performance on Alu enriched simulated ChlP-seq data 
for human chromosome 19. Figure S5. Comparison of counting strategy 
performance over a wide-range of parameters for human chromosome 5. 
Figure S6. Comparison of counting strategy performance over a wide- 
range of parameters for human chromosome 10. Figure S7. Comparison of 
counting strategy differential enrichment analysis predictions for ChlP-seq 
data simulations over human chromosome 5. Figure S8. Comparison of 
counting strategy differential enrichment analysis predictions for ChlP-seq 
data simulations over human chromosome 10. Figure S9. Comparison of 
counting strategy differential enrichment analysis predictions for real 
ChlP-seq data. Figure S10. Representative genome browser view of 
ENCODE enrichment tracks. Figure S1 1 . Pol III promoter-type assignment. 
Figure SI 2. ENCODE RNA Polymerases and differential RNA-seq analysis of 
SINE, tRNA, and Satellite class elements. Figure SI 3. ENCODE RNA 
Polymerases and differential RNA-seq analysis of DNA, LINE, and LTR class 
elements. Figure S14. Summary of RNA polymerase II enrichment to LTR 
retrotransposons in ENCODE cell-lines. Figure S16. Example of homology 
between repetitive element L1PA subfamilies. Figure SI 7. Effect of read 
length on repetitive element subfamily read assignment. 

Additional file 2: Table SI. Description of publically available datasets 
used in this study. Set 1-7 refers to how the datasets were grouped into 
separate GLM models that were used in our analysis (see Methods). 
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